Towards an Understanding of Default Policies in Multitask Policy Optimization. (arXiv:2111.02994v3 [cs.LG] UPDATED)
(2 min)
Much of the recent success of deep reinforcement learning has been driven by
regularized policy optimization (RPO) algorithms with strong performance across
multiple domains. In this family of methods, agents are trained to maximize
cumulative reward while penalizing deviation in behavior from some reference,
or default policy. In addition to empirical success, there is a strong
theoretical foundation for understanding RPO methods applied to single tasks,
with connections to natural gradient, trust region, and variational approaches.
However, there is limited formal understanding of desirable properties for
default policies in the multitask setting, an increasingly important domain as
the field shifts towards training more generally capable agents. Here, we take
a first step towards filling this gap by formally linking the quality of the
default policy to its effect on optimization. Using these results, we then
derive a principled RPO algorithm for multitask learning with strong
performance guarantees.